v1.0: rewrite parsing backend on pypdfium2#124
Open
codereverser wants to merge 5 commits into
Open
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #124 +/- ##
==========================================
+ Coverage 88.66% 97.04% +8.39%
==========================================
Files 18 19 +1
Lines 1463 2295 +832
==========================================
+ Hits 1297 2227 +930
+ Misses 166 68 -98 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Rebuilds the parsing layer for v1.0 on top of pypdfium2 (Apache-2.0 / BSD-3) so casparser ships pure MIT end-to-end; the prior pdfminer.six + PyMuPDF dependencies are dropped along with the entire `casparser/ process/` regex-tokenisation pipeline they fed. Engine (casparser/parsers/extract.py, pageobj.py) ================================================= `extract.py` walks PDF page objects (one atom per text-show op), maps glyphs to their parent atom via PDFium's `FPDFText_GetTextObject`, deduplicates same-font overlapping atoms, then emits `Char`/`Line`/`Page` shaped output that downstream parsers consume. Atom-level dedup replaces all per-character overlay heuristics: when two atoms share a font, x-overlap by >=50% of the narrower atom's width, and sit 0.05-3.0pt apart in y, we drop the one further from the row's median baseline. That handles the date-twin artefact (same date column rendered twice with a small y-offset, glyphs interleaving by x to produce garbage like `2020 -> 22002200`) without the multi-stage sub-cluster filters earlier prototypes used. `pageobj.py` exposes the atoms + their column/block grouping that the NSDL/CDSL parsers operate on directly. The same Atom primitive backs the investor extractor. Per-issuer parsers ================== - `cams_detailed.py` / `cams_summary.py` consume the Line stream for CAMS + KFin DETAILED and SUMMARY templates. - `nsdl.py` reads the page-2 account roster, walks per-account holdings sections (equities + mutual funds + corporate bonds in both summary and detailed forms). Section-aware routing handles the case where multiple holding types share the same 18-cell detailed table header by tracking `cur_section` from the preceding marker block. The page-2 roster accepts both the 4-cell (broker + DP/Client joined) and 5-cell (broker, then DP/Client) variants. - `cdsl.py` mirrors NSDL's structure for the CDSL CAS template. Types ===== - Adds `Bond` model with optional coupon_rate / coupon_frequency / maturity_date / face_value / market_price; required fields are isin, num_bonds, value. Surfaces on `DematAccount.bonds`. - `investor_info` is now required on `CASData` and `NSDLCASData`. Performance =========== The dispatcher opens the PDF document exactly once per `read_cas_pdf` call and threads the handle through detect / parser / investor extractor via an `_doc=` kwarg. NSDL/CDSL additionally share the extracted atoms between the holdings parser and the investor extractor.
Replaces the v0.8 pdfminer / PyMuPDF test files with a per-issuer e2e suite plus a focused unit-test layer. Layout ====== - `tests/conftest.py` — module-scoped fixtures for each fixture PDF (CAMS / KFin / NSDL / CDSL detailed + summary). Each fixture loader skips its dependent tests when the corresponding env var isn't set, so contributors without the encrypted bundle can still run the unit-test portion. - `tests/_assertions.py` — invariant helpers shared across the e2e suite. Designed to lock in correctness without encoding the real rupee figures from private fixtures. - `tests/test_cams.py`, `test_kfin.py`, `test_nsdl.py`, `test_cdsl.py` — per-issuer e2e tests. - `tests/test_errors.py` — error-path + back-compat shim tests. - `tests/test_demat_units.py` — NSDL/CDSL parser unit tests using synthetic Block/Cell fixtures (no real ISINs, names, or IDs). - `tests/test_helpers.py`, `tests/test_gains.py`, `tests/test_gains_e2e.py` — existing helper / gains coverage, retained. Arithmetic invariants ===================== The e2e tests verify parsing correctness without depending on specific rupee amounts: - **CAMS / KFin DETAILED**: scheme.close * scheme.valuation.nav == scheme.valuation.value and scheme.open + sum(txn.units) == scheme.close. - **NSDL / CDSL**: sum(eq.value + mf.value + bd.value) == account.balance per account; mf.balance * mf.nav == mf.value; bond.num_bonds * bond.face_value == bond.value (summary form); bond.num_bonds * bond.market_price == bond.value (detailed form). These catch column-swap, decimal-parse, anchor-drift, and missed- transaction bugs without encoding portfolio totals in the repo. Removed ======= - `tests/test_pdfminer.py`, `tests/test_mupdf.py`, `tests/test_process.py` — backend-specific suites for the v0.8 stack. - `tests/test_pypdfium.py`, `tests/base.py` — the intermediate single-file test suite is superseded by the per-issuer split. - `tests/pytest.ini` — empty file masked the pyproject.toml pytest config.
- `pyproject.toml`: bumps version to 1.0.0, drops pdfminer.six (AGPL-3.0+) and PyMuPDF (GPL-3.0+ / commercial) from runtime deps, replaces with pypdfium2 (Apache-2.0 / BSD-3). Loosens remaining version bounds where compatible (click <10, rich <16, pypdfium2 <7, pydantic <3, etc.) and refreshes the dev-group upper bounds (pytest <10, pytest-cov <8, ipython, coverage). - Python floor lifts to 3.11 (3.10 EOL anyway). - `uv.lock`: regenerated against the new dep set. - `.github/workflows/run-pytest.yml`: switches CI to Python 3.12, decrypts `tests/files.enc` for the encrypted fixture bundle, and exposes the per-fixture env-var matrix to pytest. PyPI publish workflow updated to drop the dropped backends. - `licenses/AGPL-3.0+.txt` + `licenses/GPL-3.0+.txt`: removed — no longer required to redistribute since the GPL/AGPL deps are gone. - `README.md`: documents the v1.0 backend swap and refreshes external links. `CHANGELOG.md` gets a 1.0.0 section.
cd21b91 to
01e87c1
Compare
v0.9.0 shipped a PyMuPDF-1.25 compatibility fix on top of v0.8's
existing backend; v1.0 has already replaced that backend with
pypdfium2, so the v0.9 parser patches don't apply. The merge keeps
v1.0's parser layer and folds in the v0.9 metadata changes that
are still relevant:
- `casparser-isin>=2026.5.1` (DB format v2 with sebi_category /
last_seen / ISIN-first lookup priority) — adopted.
- `pdfminer.six` and the `mupdf`/`fast` PyMuPDF extras stay
removed (1.0.0's pure-pypdfium2 stack).
- `MutualFund.fix_float` aliased-field bug fix (v0.9 patched it
on the v0.8 model; v1.0's model already carries the same fix).
- CI matrix: adopt v0.9's `[3.11, 3.12, 3.13]` Python matrix; drop
`--all-extras` from `uv sync` (no extras to install any more).
- CHANGELOG keeps the 1.0.0 entry on top and a condensed 0.9.0
entry below it for historical record.
Files v0.9 modified that v1.0 had already deleted are kept deleted:
- casparser/parsers/mupdf.py
- casparser/process/{__init__,cas_detailed,cas_summary,cdsl_statement,
nsdl_statement,regex,utils}.py
Tests: 151/151 with private fixtures, 87/87 + 64 skipped without.
GitHub deprecation notice — actions running on Node.js 20 will be forced to Node.js 24 from 2026-06-02. Bump every referenced action to its current Node-24-native major: actions/checkout v4 → v6 actions/setup-python v5 → v6 astral-sh/setup-uv v5 → v8 codecov/codecov-action v5 → v6 These are all backward-compatible at the workflow-input level; no input changes required.
b41c14a to
09773ac
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
v1.0 is a full rewrite of the CAS parsing layer:
mupdf/fastextras and the--force-pdfminerCLI flag (force_pdfminer=kwarg kept as a no-op withDeprecationWarning).Why
MutualFund.fix_floatvalidator miss onOptional[Decimal]aliased fields.What's in the box
casparser/parsers/:pageobj.py— shared page-object atom extractor (NSDL/CDSL)extract.py— char/line extractor (CAMS/KFin)cams_detailed.py,cams_summary.py,nsdl.py,cdsl.pydetect.py— file-type sniffer; wrapsPdfiumErrorintoCASParseError/IncorrectPasswordError_classify.py,_isin.py,_investor.py— shared helpersinvestor_info,folio.PAN/KYC/PANKYC,scheme.isin/amfi/type,scheme.nominees,scheme.valuation.cost.investor_infois now required on bothCASDataandNSDLCASData(matches the contract of "every CAS contains an investor block").DIRECT(non-ARN) distribution-mode rows populate PnL/return.casparser-isinwith a direct-ISIN fallback path for templates where multi-line registrar rendering mangles the RTA token.Bug fixes that landed alongside the rewrite
valuation.datewas mis-parsing todate(201, 1, 1)— column boundary + Pydantic coercion fix.Breaking changes
casparser.types.CASData.investor_info:Optional[InvestorInfo]→InvestorInfo(parser raisesCASParseErrorif it can't find the block).casparser.types.NSDLCASData.investor_info: same change.casparser.types.NSDLCASData.file_type:Optional[FileType] = None→FileType.ProcessedCASDataandPartialCASDataremoved fromcasparser.types(they were internal to the old pipeline).casparser.processpackage removed; surviving helpers moved tocasparser.parsers._classifyandcasparser.parsers._isin.--force-pdfminer/force_pdfminer=is a no-op (emitsDeprecationWarning).Testing
tests/test_pypdfium.py,tests/test_helpers.py,tests/test_gains.py,tests/casparser/test_cli.py).Test plan
casparserCLI on a sample CAMS/KFin/NSDL/CDSL PDFpip install -U casparser(oruv sync) installs without pulling pdfminer.six / PyMuPDF